155 research outputs found
Testing for polytomies in phylogenetic species trees using quartet frequencies
Phylogenetic species trees typically represent the speciation history as a
bifurcating tree. Speciation events that simultaneously create more than two
descendants, thereby creating polytomies in the phylogeny, are possible.
Moreover, the inability to resolve relationships is often shown as a (soft)
polytomy. Both types of polytomies have been traditionally studied in the
context of gene tree reconstruction from sequence data. However, polytomies in
the species tree cannot be detected or ruled out without considering gene tree
discordance. In this paper, we describe a statistical test based on properties
of the multi-species coalescent model to test the null hypothesis that a branch
in an estimated species tree should be replaced by a polytomy. On both
simulated and biological datasets, we show that the null hypothesis is rejected
for all but the shortest branches, and in most cases, it is retained for true
polytomies. The test, available as part of the ASTRAL package, can help
systematists decide whether their datasets are sufficient to resolve specific
relationships of interest
BBCA: Improving the Scalability of *BEAST Using Random Binning
Species tree estimation can be challenging in the presence of gene tree conflict due to incomplete lineage sorting (ILS), which can occur when the time between speciation events is short relative to the population size. Of the many methods that have been developed to estimate species trees in the presence of ILS, *BEAST, a Bayesian method that co-estimates the species tree and gene trees given sequence alignments on multiple loci, has generally been shown to have the best accuracy. However, *BEAST is extremely computationally intensive so that it cannot be used with large numbers of loci; hence, *BEAST is not suitable for genome-scale analyses. Results: We present BBCA (boosted binned coalescent-based analysis), a method that can be used with *BEAST (and other such co-estimation methods) to improve scalability. BBCA partitions the loci randomly into subsets, uses *BEAST on each subset to co-estimate the gene trees and species tree for the subset, and then combines the newly estimated gene trees together using MP-EST, a popular coalescent-based summary method. We compare time-restricted versions of BBCA and *BEAST on simulated datasets, and show that BBCA is at least as accurate as *BEAST, and achieves better convergence rates for large numbers of loci. Conclusions: Phylogenomic analysis using *BEAST is currently limited to datasets with a small number of loci, and analyses with even just 100 loci can be computationally challenging. BBCA uses a very simple divide-and-conquer approach that makes it possible to use *BEAST on datasets containing hundreds of loci. This study shows that BBCA provides excellent accuracy and is highly scalable.Grant Agency of the Czech Republic P501-10-0208Academy of Sciences of the Czech Republic AVOZ50040507, AVOZ50040702, MSMT LC0604Ministry of Innovation and Science of Spain, MICINN CGL2007-64839-C02/BOSCSIC (Superior Council of Scientific InvestigationsJosé Castillejo Grant from the MICINN of the Spanish GovernmentComputer Science
Optimal Subtree Prune and Regraft for Quartet Score in Sub-Quadratic Time
Finding a tree with the minimum total distance to a given set of trees (the median tree) is increasingly needed in phylogenetics. Defining tree distance as the number of induced four-taxon unrooted (i.e., quartet) trees with different topologies, the median of a set of gene trees is a statistically consistent estimator of the species tree under several models of gene tree species tree discordance. Because of this, median trees defined with quartet distance are widely used in practice for species tree inference. Nevertheless, the problem is NP-Hard and the widely-used solutions are heuristics. In this paper, we pave the way for a new type of heuristic solution to this problem. We show that the optimal place to add a subtree of size m onto a tree with n leaves can be found in time that grows quasi-linearly with n and is nearly independent of m. This algorithm can be used to perform subtree prune and regraft (SPR) moves efficiently, which in turn enables the hill-climbing heuristic search for the optimal tree. In exploratory experiments, we show that our algorithm can improve the quartet score of trees obtained using the existing widely-used methods
Ultra-large alignments using Phylogeny-aware Profiles
Many biological questions, including the estimation of deep evolutionary
histories and the detection of remote homology between protein sequences, rely
upon multiple sequence alignments (MSAs) and phylogenetic trees of large
datasets. However, accurate large-scale multiple sequence alignment is very
difficult, especially when the dataset contains fragmentary sequences. We
present UPP, an MSA method that uses a new machine learning technique - the
Ensemble of Hidden Markov Models - that we propose here. UPP produces highly
accurate alignments for both nucleotide and amino acid sequences, even on
ultra-large datasets or datasets containing fragmentary sequences. UPP is
available at https://github.com/smirarab/sepp.Comment: Online supplemental materials and data are available at
http://www.cs.utexas.edu/users/phylo/software/upp
MRL and SuperFine+MRL: new supertree methods
<p>Abstract</p> <p>Background</p> <p>Supertree methods combine trees on subsets of the full taxon set together to produce a tree on the entire set of taxa. Of the many supertree methods, the most popular is MRP (Matrix Representation with Parsimony), a method that operates by first encoding the input set of source trees by a large matrix (the "MRP matrix") over {0,1, ?}, and then running maximum parsimony heuristics on the MRP matrix. Experimental studies evaluating MRP in comparison to other supertree methods have established that for large datasets, MRP generally produces trees of equal or greater accuracy than other methods, and can run on larger datasets. A recent development in supertree methods is SuperFine+MRP, a method that combines MRP with a divide-and-conquer approach, and produces more accurate trees in less time than MRP. In this paper we consider a new approach for supertree estimation, called MRL (Matrix Representation with Likelihood). MRL begins with the same MRP matrix, but then analyzes the MRP matrix using heuristics (such as RAxML) for 2-state Maximum Likelihood.</p> <p>Results</p> <p>We compared MRP and SuperFine+MRP with MRL and SuperFine+MRL on simulated and biological datasets. We examined the MRP and MRL scores of each method on a wide range of datasets, as well as the resulting topological accuracy of the trees. Our experimental results show that MRL, coupled with a very good ML heuristic such as RAxML, produced more accurate trees than MRP, and MRL scores were more strongly correlated with topological accuracy than MRP scores.</p> <p>Conclusions</p> <p>SuperFine+MRP, when based upon a good MP heuristic, such as TNT, produces among the best scores for both MRP and MRL, and is generally faster and more topologically accurate than other supertree methods we tested.</p
Weighted Statistical Binning: enabling statistically consistent genome-scale phylogenetic analyses
Because biological processes can make different loci have different
evolutionary histories, species tree estimation requires multiple loci from
across the genome. While many processes can result in discord between gene
trees and species trees, incomplete lineage sorting (ILS), modeled by the
multi-species coalescent, is considered to be a dominant cause for gene tree
heterogeneity. Coalescent-based methods have been developed to estimate species
trees, many of which operate by combining estimated gene trees, and so are
called summary methods. Because summary methods are generally fast, they have
become very popular techniques for estimating species trees from multiple loci.
However, recent studies have established that summary methods can have reduced
accuracy in the presence of gene tree estimation error, and also that many
biological datasets have substantial gene tree estimation error, so that
summary methods may not be highly accurate on biologically realistic
conditions. Mirarab et al. (Science 2014) presented the statistical binning
technique to improve gene tree estimation in multi-locus analyses, and showed
that it improved the accuracy of MP-EST, one of the most popular
coalescent-based summary methods. Statistical binning, which uses a simple
statistical test for combinability and then uses the larger sets of genes to
re-calculate gene trees, has good empirical performance, but using statistical
binning within a phylogenomics pipeline does not have the desirable property of
being statistically consistent. We show that weighting the recalculated gene
trees by the bin sizes makes statistical binning statistically consistent under
the multispecies coalescent, and maintains the good empirical performance.
Thus, "weighted statistical binning" enables highly accurate genome-scale
species tree estimation, and is also statistical consistent under the
multi-species coalescent model.Comment: (1) In Press, PLoS ON
- …